Normalize executor and utility query text with $N placeholders by iskakaushik · Pull Request #51 · ClickHouse/pg_stat_ch

iskakaushik · 2026-04-03T01:59:42Z

Summary

add post_parse_analyze_hook query normalization based on PostgreSQL's JumbleState, porting generate_normalized_query / fill_in_constant_lengths from pg_stat_statements
carry normalized text to ExecutorEnd / ProcessUtility using exact statement-scoped state, including nested SPI reuse and multi-statement handling
scope normalization out of emit_log_hook; error events are still captured, but event.query is left empty instead of doing fuzzy reconstruction
fix duplicate escape-string warnings, nested normalization regressions, and PG16/17 build compatibility
add TAP coverage for normalization behavior, failed-statement isolation, warning behavior, and nested SPI reuse

Test plan

Builds against PG16, PG17, and PG18
./scripts/run-tests.sh ../postgres/install_tap tap 008
./scripts/run-tests.sh ../postgres/install_tap tap 010
./scripts/run-tests.sh ../postgres/install_tap tap 022
./scripts/run-tests.sh ../postgres/install_tap tap 027
./scripts/run-tests.sh ../postgres/install_tap tap 028
./scripts/run-tests.sh ../postgres/install_tap tap 029
./scripts/run-tests.sh ../postgres/install_tap tap 030

Port pg_stat_statements' query normalization to pg_stat_ch so that captured queries never contain raw literal values (passwords, PII, tokens, etc.). Both ClickHouse and OTel exporters now receive normalized text like: SELECT * FROM users WHERE email = $1 AND age > $2 Instead of: SELECT * FROM users WHERE email = 'alice@example.com' AND age > 30 Implementation: - Add post_parse_analyze_hook to capture JumbleState at parse time - Port generate_normalized_query / fill_in_constant_lengths from pg_stat_statements (static functions, not exported by PG) - Store normalized text per-backend in TopMemoryContext, consumed by ExecutorEnd/ProcessUtility/emit_log_hook - Schema names, table names, column names, operators, and SQL structure are preserved; only literal constants are replaced Includes TAP test (027_query_normalization.pl) with 16 subtests covering strings, numbers, negatives, floats, IN lists, LIKE, BETWEEN, subqueries, multi-statement, INSERT/UPDATE/DELETE, and schema-qualified names.

Copilot

Pull request overview

This PR introduces literal-constant query normalization (replacing strings/numbers/etc. with $N placeholders) so that exported query text to ClickHouse and OpenTelemetry never contains raw parameter values, leveraging PostgreSQL’s JumbleState from post_parse_analyze_hook.

Changes:

Add a post_parse_analyze_hook to generate and stash a normalized query string during parse/analyze.
Port/implement pg_stat_statements-style query normalization utilities (PschNormalizeQuery) to replace constants with $N.
Add a TAP test suite (t/027_query_normalization.pl) covering a variety of constant forms and SQL shapes.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
t/027_query_normalization.pl	Adds integration coverage validating that exported query text contains placeholders and not raw literals.
src/hooks/query_normalize.h	Declares the normalization API used by hooks to parameterize query text.
src/hooks/query_normalize.cc	Implements normalization logic (ported from pg_stat_statements approach) to rewrite constants into `$N`.
src/hooks/hooks.cc	Installs/uses `post_parse_analyze_hook` and prefers normalized query text when populating events (including error events).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-03T02:07:17Z

src/hooks/hooks.cc

+// post_parse_analyze_hook — normalize query text at parse time.
+// The JumbleState (with constant locations) is only available here, so we
+// must generate the normalized text now and stash it for ExecutorEnd.
+static void PschPostParseAnalyze(ParseState* pstate, Query* query, JumbleState* jstate) {
+  if (prev_post_parse_analyze != nullptr) {
+    prev_post_parse_analyze(pstate, query, jstate);
+  }
+
+  // Only normalize if enabled and the query has constants to replace
+  if (!psch_enabled || jstate == nullptr || jstate->clocations_count <= 0) {
+    return;
+  }
+
+  // Free any stale normalized text from a previous parse in this backend
+  ClearNormalizedQuery();
+
+  const char* query_text = pstate->p_sourcetext;
+  int query_loc = query->stmt_location;
+  int query_len = query->stmt_len;
+
+  // CleanQuerytext extracts the relevant portion for multi-statement strings
+  query_text = CleanQuerytext(query_text, &query_loc, &query_len);
+
+  // Allocate in TopMemoryContext so the normalized text survives until
+  // ExecutorEnd consumes it (the parse-time context is short-lived).
+  MemoryContext oldcxt = MemoryContextSwitchTo(TopMemoryContext);
+  normalized_query = PschNormalizeQuery(query_text, query_loc, &query_len, jstate);
+  MemoryContextSwitchTo(oldcxt);
+  normalized_query_len = query_len;
+}


normalized_query is a single per-backend buffer that can become stale or be overwritten before it is consumed. In particular: (1) PschPostParseAnalyze() can run for statements that later hit the early-return path in PschExecutorEnd() (e.g. queryId == 0), leaving normalized_query set; the next captured query will then incorrectly use this stale normalized text. (2) nested statements (SPI / functions) can overwrite normalized_query before the outer statement reaches ExecutorEnd, causing the outer event to lose normalization or pick up the inner query’s text. Consider storing normalized text in a stack keyed by nesting_level, or in a small per-backend map keyed by queryId/statement identity, and ensure the entry is removed on all completion paths (including the PschExecutorEnd() early return and utility paths).

Copilot · 2026-04-03T02:07:17Z

src/hooks/hooks.cc

+  // Use normalized query if available; fall back to debug_query_string
+  if (normalized_query != nullptr) {
+    size_t len = Min(static_cast<size_t>(normalized_query_len), sizeof(event.query) - 1);
+    memcpy(event.query, normalized_query, len);
+    event.query[len] = '\0';
+    event.query_len = static_cast<uint16>(len);
+  } else if (debug_query_string != nullptr && debug_query_string[0] != '\0') {
+    event.query_len =
+        static_cast<uint16>(CopyTrimmed(event.query, PSCH_MAX_QUERY_LEN, debug_query_string));
  }


CaptureLogEvent() copies from normalized_query but never frees/clears it. If an error occurs before the normal query event is built (or if the statement never reaches ExecutorEnd), this leaks TopMemoryContext memory and can cause later events to reuse the wrong normalized text. After copying into event.query, clear the stored normalized buffer (or pop the current entry if switching to a per-statement stack/map).

Copilot · 2026-04-03T02:07:18Z

src/hooks/query_normalize.cc

+  // A constant is at least 1 byte; $N is at most 11 bytes (INT_MAX).
+  norm_query_buflen = query_len + (jstate->clocations_count * 10);
+  norm_query = static_cast<char*>(palloc(norm_query_buflen + 1));
+
+  for (int i = 0; i < jstate->clocations_count; i++) {
+    int off;
+    int tok_len;
+
+    // If we have an external param at this location but no squashed lists,
+    // skip it so the original $N text is preserved.
+    if (jstate->clocations[i].extern_param && !jstate->has_squashed_lists) {
+      continue;
+    }
+
+    off = jstate->clocations[i].location;
+    off -= query_loc;
+    tok_len = jstate->clocations[i].length;
+
+    if (tok_len < 0) {
+      continue;  // duplicate, ignore
+    }
+
+    // Copy the chunk between last constant and this one
+    len_to_wrt = off - last_off - last_tok_len;
+    Assert(len_to_wrt >= 0);
+    memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
+    n_quer_loc += len_to_wrt;
+
+    // Insert $N placeholder (and squashed-list comment if applicable)
+    n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d%s",
+                          num_constants_replaced + 1 + jstate->highest_extern_param_id,
+                          jstate->clocations[i].squashed ? " /*, ... */" : "");
+    num_constants_replaced++;
+
+    quer_loc = off + tok_len;


norm_query_buflen sizing (query_len + clocations_count * 10) is not a guaranteed upper bound once you account for the optional squashed-list suffix (" /*, ... */") and larger $N values. Combined with the use of sprintf(), this can lead to a write past the end of norm_query before the later Assert() triggers. Use a dynamically-growing buffer (e.g., StringInfo) or compute a strict upper bound and write with snprintf() using the remaining capacity, erroring out safely if it would overflow.

Copilot · 2026-04-03T02:07:18Z

t/027_query_normalization.pl

+);
+
+# Helper: run a query, flush, and return the captured query text from ClickHouse.
+# Uses queryid to identify the specific query.


The helper comment says it "Uses queryid to identify the specific query", but get_captured_query() actually selects the most recent non-extension query by timestamp (no queryid filtering). Either update the comment to match the implementation or add a queryid filter (e.g., capture pg_stat_activity.query_id/pg_query_id() equivalent) to make the selection deterministic when multiple statements are exported close together.

Suggested change

# Uses queryid to identify the specific query.

# Clears prior exported rows and returns the most recent non-extension query by timestamp.

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-03T03:25:22Z

src/hooks/query_normalize.cc

+static void FillInConstantLengths(JumbleState* jstate, const char* query, int query_loc) {
+  LocationLen* locs;
+  core_yyscan_t yyscanner;
+  core_yy_extra_type yyextra;


core_yy_extra_type yyextra is uninitialized when passed to scanner_init(). If scanner_init() doesn’t fully overwrite all fields, this risks undefined behavior. Consider zero-initializing yyextra before calling scanner_init().

Suggested change

core_yy_extra_type yyextra;

core_yy_extra_type yyextra = {};

Copilot · 2026-04-03T03:25:22Z

src/hooks/hooks.cc

+// The normalized registry is keyed by statement identity and reused across
+// repeated executions of cached plans, so this helper first looks up the
+// normalized entry stashed at parse time. If no match exists, it falls back to
+// CopyRawStatementText, which preserves the literal SQL text for the current
+// statement only.
+static void CopyQueryText(PschEvent* event, const char* query_text, int stmt_location,
+                          int stmt_len) {
+  if (PschCopyNormalizedQueryForStatement(&backend_state.normalized_queries, event->query,
+                                          sizeof(event->query), &event->query_len, query_text,
+                                          stmt_location, stmt_len, false)) {


CopyQueryText() uses consume=false when calling PschCopyNormalizedQueryForStatement(), so normalized entries remembered at parse time are never removed on the successful path. Since those entries live in TopMemoryContext and are keyed by raw source_text, sessions that execute many literal-variant statements can accumulate entries and grow backend memory without bound. Consider consuming/removing entries after enqueue (when safe) and/or adding an eviction/size cap (or keying by queryId) to prevent unbounded growth.

Suggested change

// The normalized registry is keyed by statement identity and reused across

// repeated executions of cached plans, so this helper first looks up the

// normalized entry stashed at parse time. If no match exists, it falls back to

// CopyRawStatementText, which preserves the literal SQL text for the current

// statement only.

static void CopyQueryText(PschEvent* event, const char* query_text, int stmt_location,

int stmt_len) {

if (PschCopyNormalizedQueryForStatement(&backend_state.normalized_queries, event->query,

sizeof(event->query), &event->query_len, query_text,

stmt_location, stmt_len, false)) {

// The normalized entry is only needed until it has been copied into the event

// buffer, so consume it on a successful lookup to avoid unbounded retention in

// the per-backend registry. If no match exists, fall back to the raw statement

// text for the current statement only.

static void CopyQueryText(PschEvent* event, const char* query_text, int stmt_location,

int stmt_len) {

if (PschCopyNormalizedQueryForStatement(&backend_state.normalized_queries, event->query,

sizeof(event->query), &event->query_len, query_text,

stmt_location, stmt_len, true)) {

Copilot · 2026-04-03T03:25:22Z

src/hooks/hooks.cc

+  if (PschCopyNormalizedQueryForLog(&backend_state.normalized_queries, event.query,
+                                    sizeof(event.query), &event.query_len, debug_query_string,
+                                    pgstat_get_my_query_id(), edata->cursorpos)) {
+  } else if (debug_query_string != nullptr && debug_query_string[0] != '\0') {
+    event.query_len =


If PschCopyNormalizedQueryForLog() can’t find a match, the code falls back to copying debug_query_string verbatim into the event, which can still include raw literals (passwords/PII). This contradicts the PR description’s claim that “no raw parameter values are ever captured”. If the intent is a hard guarantee, consider emitting an empty/"" query (or gating raw capture behind a GUC) when normalization lookup fails.

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-03T04:10:05Z

src/hooks/query_normalize_state.cc

+  PschNormalizedQueryEntry* next;
+};
+
+PschStatementKey PschMakeStatementKey(const char* source_text, int stmt_location, int stmt_len) {


SourceTextMatch() treats pointer equality (entry->source_text == source_text) as an unconditional match. Postgres can reuse memory addresses for new query strings across statements (e.g., after memory context resets), so pointer equality alone can incorrectly match a different statement and return the wrong normalized query. Consider removing the pointer fast-path, or at least validating length/content even when pointers match (e.g., compare lengths and memcmp against source_text_copy).

Suggested change

PschNormalizedQueryEntry* next;

};

PschStatementKey PschMakeStatementKey(const char* source_text, int stmt_location, int stmt_len) {

Copilot · 2026-04-03T04:10:05Z

src/hooks/hooks.cc

+static void CopyQueryText(PschEvent* event, const char* query_text, int stmt_location,
+                          int stmt_len) {
+  const PschStatementKey statement_key = PschMakeStatementKey(query_text, stmt_location, stmt_len);
+  if (PschCopyNormalizedQueryForStatement(&backend_state.normalized_queries, event->query,
+                                          sizeof(event->query), &event->query_len, statement_key,
+                                          false)) {
+    return;
  }
+
+  CopyRawStatementText(event, query_text, stmt_location, stmt_len);


Normalized query entries are always copied with consume=false, and successful executor/utility paths never call ForgetNormalizedStatement(). This makes the per-backend normalized query registry grow without bound over long-lived sessions that execute many distinct statements, which can lead to backend memory bloat. Consider consuming entries by default after they’re used (and only retaining them for known-reuse cases like cached plans/SPI), or enforcing an upper bound/eviction strategy for the registry.

Copilot AI review requested due to automatic review settings April 3, 2026 01:59

Copilot started reviewing on behalf of iskakaushik April 3, 2026 02:00 View session

Fix clang-format violations in query_normalize.cc

af6b557

Copilot AI reviewed Apr 3, 2026

View reviewed changes

iskakaushik added 2 commits April 2, 2026 21:55

Fix normalization state handling and harden TAP waits

3c4aeaf

Fix nested query normalization reuse

3170383

Copilot AI review requested due to automatic review settings April 3, 2026 03:19

Copilot started reviewing on behalf of iskakaushik April 3, 2026 03:19 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

iskakaushik added 2 commits April 2, 2026 22:35

Add PG16/17 normalization compatibility

10979da

Scope query normalization out of error capture

eb15e83

Copilot AI review requested due to automatic review settings April 3, 2026 04:04

iskakaushik changed the title ~~Normalize query text: replace constants with $N placeholders~~ Normalize executor and utility query text with $N placeholders Apr 3, 2026

Copilot started reviewing on behalf of iskakaushik April 3, 2026 04:04 View session

Hash normalized statement identity lookups

ed3b62f

Copilot AI reviewed Apr 3, 2026

View reviewed changes

Use statement key in query text copy helpers

01477e5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize executor and utility query text with $N placeholders#51

Normalize executor and utility query text with $N placeholders#51
iskakaushik wants to merge 8 commits intomainfrom
feat/query-normalization

iskakaushik commented Apr 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	# Uses queryid to identify the specific query.
	# Clears prior exported rows and returns the most recent non-extension query by timestamp.

Conversation

iskakaushik commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

iskakaushik commented Apr 3, 2026 •

edited

Loading